Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Introduce Object Table to manage unstructured files #4459

Merged
merged 6 commits into from
Nov 12, 2024

Conversation

JingsongLi
Copy link
Contributor

@JingsongLi JingsongLi commented Nov 5, 2024

Purpose

We plan to introduce an Object Table to provide metadata indexes for unstructured data objects in the specified Object Storage storage directory. Object tables allow users to analyze unstructured data in Object Storage:

  1. Use Python API to manipulate these unstructured data, such as converting images to PDF format.
  2. Model functions can also be used to perform inference, and then the results of these operations can be concatenated with other structured data in the Catalog.

The object table is managed by Catalog and can also have access permissions and the ability to manage blood relations.

The schema for Object Table is:

    RowType SCHEMA =
            RowType.builder()
                    .field("path", DataTypes.STRING().notNull())
                    .field("name", DataTypes.STRING().notNull())
                    .field("length", DataTypes.BIGINT().notNull())
                    .field("mtime", DataTypes.TIMESTAMP_LTZ_MILLIS())
                    .field("atime", DataTypes.TIMESTAMP_LTZ_MILLIS())
                    .field("owner", DataTypes.STRING().nullable())
                    .field("generation", DataTypes.INT().nullable())
                    .field("content_type", DataTypes.STRING().nullable())
                    .field("storage_class", DataTypes.STRING().nullable())
                    .field("md5_hash", DataTypes.STRING().nullable())
                    .field("metadata_mtime", DataTypes.TIMESTAMP_LTZ_MILLIS().nullable())
                    .field("metadata", DataTypes.MAP(DataTypes.STRING(), DataTypes.STRING()))
                    .build()
                    .notNull();

The SQL usage is:

-- Create Object Table

CREATE TABLE `my_object_table` WITH (
  'type' = 'object-table',
  'object-location' = 'oss://my_bucket/my_location' 
)

-- Refresh Object Table

CALL sys.refresh_object_table('mydb.my_object_table');

-- Query Object Table

SELECT * FROM `my_object_table`;

-- Query Object Table with Time Travel

SELECT * FROM `my_object_table` /*+ OPTIONS('scan.snapshot-id' = '1') */;

Tests

ObjectTableITCase

API and Format

Documentation

* @return A long value representing the time the file was last accessed, measured in
* milliseconds since the epoch (UTC January 1, 1970).
*/
default long getAccessTime() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getLatestAccessTime may better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same to getModificationTime.

* CALL sys.refresh_object_table('tableId')
* </code></pre>
*/
public class RefreshObjectTableProcedure extends ProcedureBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to doc?


@Override
public TableCommitImpl newCommit(String commitUser) {
throw new UnsupportedOperationException("Object table does not support Write.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object table does not support Commit.

table =
ObjectTable.builder()
.underlyingTable(table)
.objectLocation(options.objectLocation())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check options.objectLocation() is null?

@wwj6591812
Copy link
Contributor

”We plan to introduce an Object Table to provide metadata indexes for unstructured data objects in the specified Object Storage storage directory.“

Hi, jingsong.
After see the pip, no relevant information was found about Paimon supporting unstructured data storage.
Could you describe Paimon’s future plans in the unstructured data storage? Our company is currently exploring how to use Paimon for unstructured analysis.

@JingsongLi
Copy link
Contributor Author

”We plan to introduce an Object Table to provide metadata indexes for unstructured data objects in the specified Object Storage storage directory.“

Hi, jingsong. After see the pip, no relevant information was found about Paimon supporting unstructured data storage. Could you describe Paimon’s future plans in the unstructured data storage? Our company is currently exploring how to use Paimon for unstructured analysis.

Unstructured data is definitely not stored in Paimon for the most cases, so what we want to do here is simply map these files so that users can process the metadata of unstructured files and see if they need to read these files as needed.

@yanbinyang
Copy link

+1

@JingsongLi JingsongLi merged commit 4ccda6f into apache:master Nov 12, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants